In the last 2 previous Notebook, we trained a model of Sentiment Analysis and create an API to query live tweets, analyse them and store them in a NoSQL database. This model ran during the match France Argentian of 30/06/2018 during the World Cup 2018. In this netbook, we will dig a bit deeper in those data and try to analyse multiple things.
import numpy as np
import json
import datetime
import tqdm
import seaborn as sns
from collections import Counter
from nltk.corpus import stopwords
import pandas as pd
from bson import json_util
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import matplotlib.dates as md
from matplotlib import dates
from wordcloud import WordCloud
import pymongo
from pymongo import MongoClient
With so few datas and no being very comfortable with NoSQL database, we will first create a DataFrame.
client = MongoClient('localhost', 27017)
db = client['Twitter_db']
collection_clean = db['tweets_clean']
print(json.dumps(collection_clean.find_one(), indent=4, default=json_util.default))
This what is stored for every tweets. We have the complete text, some tokens found with a TweetTokenizer and a Stemmer from nltk. We also have all hashtags and the predicted sentiment. Now we can create a DataFrame empty do be filled with datas from MongoDB.
nb = collection_clean.count_documents({})
df = pd.DataFrame(index=range(nb), columns=['ID','Text', "Time",'Hashtags','Sentiment','tokens'])
for i, record in enumerate(collection_clean.find()):
obj = {
'ID' : record["_id"],
'Text' : record["text"],
'Time' : record["time"],
'Hashtags' : "-".join(record["hashtags"]),
'Sentiment' : record["sentiment"],
'tokens' : "-".join(record["tokens"])
}
df.iloc[i, :] = obj
df = df.set_index("ID")
client.close()
df.info()
df.to_csv("F:/Twitter_data/dataset/fra_arg_full.csv", encoding="utf-8", sep="%")
Now we have converted our 112700 tweets to a dataframe and save it to not do those steps later
Now, we will explore the content for multiple points but first let's prepare the required datas
df = pd.read_csv("F:/Twitter_data/dataset/fra_arg_full.csv", encoding="utf-8", sep="%", index_col=0)
df.tokens = df.tokens.fillna("N/A")
df.Hashtags = df.Hashtags.fillna("N/A")
df['Time'] = pd.to_datetime(df['Time'])
df.head(5)
First, we can look at all tokens and their frequencies. To do so, we will remove the standard StopWords from English Vocabulary and also the 2 created "words" which are "three_dots" and "exc_mark"
stopWords = stopwords.words('english')
stopWords += ['three_dot', 'exc_mark', "N/A"]
results = Counter()
df['tokens'].str.split("-").apply(results.update)
for word in stopWords:
if word in results:
del results[word]
results.most_common(50)
If we explore the result (I did the check on the top 1000 but I display the top 50 for readability), we can see that for several player for example, there is multiple grammars. The worst one is Mbappe which appears written in 5 ways.
print("mbapp:", results["mbapp"])
print("mbappe:", results["mbappe"])
print("mbappé:", results["mbappé"])
print("bappe:", results["bappe"])
print("bappé:", results["bappé"])
print("kilian:", results["kilian"])
print("killian:", results["killian"])
Due to all the cleanup required, this step will be continued just a bit later
We can do the same with Hastags but in that case, we don't have to clean it and we can look at the "balance" using "cloudword"
results_tag = Counter()
df['Hashtags'].str.split("-").apply(results_tag.update)
for word in stopWords:
if word in results_tag:
del results_tag[word]
wordcloud = WordCloud().generate_from_frequencies(results_tag)
plt.figure(figsize=(20,12))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
We can group the dataframe by minute and juste count for now the number of tweets. We can after plot the result and add also some specific actions (goals) to see the effect
agg = df.Time.groupby([df.Time.dt.hour, df.Time.dt.minute]).agg(["min", "count"])
x = agg["min"].values
y = agg["count"].values
agg.head()
start_first = datetime.datetime(2018, 6, 30, 16, 0)
end_first = datetime.datetime(2018, 6, 30, 16, 47)
start_second = datetime.datetime(2018, 6, 30, 17, 2)
end_second = datetime.datetime(2018, 6, 30, 17, 51)
goal_time = [
datetime.datetime(2018, 6, 30, 16, 13),
datetime.datetime(2018, 6, 30, 16, 40),
datetime.datetime(2018, 6, 30, 17, 5),
datetime.datetime(2018, 6, 30, 17, 14),
datetime.datetime(2018, 6, 30, 17, 21),
datetime.datetime(2018, 6, 30, 17, 25),
datetime.datetime(2018, 6, 30, 17, 50)
]
goal_tweet = []
for goal in np.array(goal_time, dtype='datetime64[ns]'):
for nb, time in zip(y, x):
if abs((time - goal) / np.timedelta64(1, 's')) < 60:
goal_tweet.append(nb)
break
goal_team = ["Fr", "Arg", "Arg", "Fr", "Fr", "Fr", "Arg"]
goal_color = ["red" if team == "Fr" else "blue" for team in goal_team]
goal_player = ["Griezmann", "Di Maria", "Mercado", "Pavard", "Mbappe", "Mbappe", "Aguero"]
delta_x = [-0.020, -0.005, 0.010, 0.005, 0.005, 0.005, 0.01]
delta_y = [100, -200, -100, -200, -600, -650, 0.050]
fault = datetime.datetime(2018, 6, 30, 16, 10)
for nb, time in zip(y, x):
if abs((time - np.array(fault, dtype='datetime64[ns]')) / np.timedelta64(1, 's')) < 60:
tweet_fault = nb
break
fig = plt.figure(figsize=(20, 12))
ax = fig.add_subplot(111)
plt.plot(x, y)
plt.axvline(x=start_first)
plt.axvline(x=end_first)
plt.axvline(x=start_second)
plt.axvline(x=end_second)
plt.axvspan(start_first, end_first, alpha=0.3, color='green', label="first half")
plt.axvspan(start_second, end_second, alpha=0.3, color='orange', label="second half")
plt.scatter(goal_time, goal_tweet, c=goal_color)
for i in range(7):
ax.annotate(goal_player[i],
xy=(md.date2num(goal_time[i]), goal_tweet[i]),
xytext=(md.date2num(goal_time[i])+delta_x[i], goal_tweet[i] + delta_y[i]),
arrowprops=dict(facecolor=goal_color[i], shrink=0.05),
color=goal_color[i],
fontsize=20
)
ax.annotate("Fault on Mbappe (Penalty)",
xy=(md.date2num(fault), tweet_fault),
xytext=(md.date2num(fault) - 0.001, tweet_fault -300),
arrowprops=dict(facecolor="black", shrink=0.05),
color="black",
fontsize=20
)
ax.xaxis.set_major_locator(dates.MinuteLocator(byminute=[0,15,30,45], interval = 1))
ax.xaxis.set_major_formatter(dates.DateFormatter('%H:%M'))
plt.ylabel("Number of Tweets", fontsize=15)
plt.title("Evolution of tweets posted during the match", fontsize=20)
plt.legend()
ax.grid(True)
plt.show()
A good way to find the start of all peaks is to derivate and look for max peaks. When we found the peak, we can extract perdios to explore and see which tokens are the most frequent
dy = []
dx = []
time_to_explore = []
for i in range(1, len(x)-1):
dy.append((y[i+1] - y[i])/((x[i+1] - x[i]) / np.timedelta64(1, 's')))
dx.append(x[i] + (x[i+1] - x[i])/2)
fig, (ax1, ax2) = plt.subplots(2, 1, figsize = (20,12))
ax1.plot(dx, dy)
ax2.plot(x, y)
for i in range(1, len(dy)-1):
if dy[i] > 5:
if dy[i] - dy[i-1] > 1:
ax1.scatter(dx[i], dy[i])
ax2.scatter(dx[i], y[i+1])
time_to_explore.append((dx[i], dx[i] + np.timedelta64(360, 's')))
plt.axvspan(dx[i], dx[i] + np.timedelta64(360, 's'), alpha=0.3, color='green', label="periods to explore")
plt.legend()
plt.show()
So now, we have all the timeframe to check. We can just use count every token for every periods and display the result as a "cloudword" too
plt.figure(figsize=(20, 25))
for i, portion in enumerate(time_to_explore):
sub_df = df[ (portion[0] < df.Time) & (df.Time <= portion[1]) ]
results_time = Counter()
sub_df['tokens'].str.split("-").apply(results_time.update)
for word in stopWords + ["argentina", "franc", "game"]:
if word in results_time:
del results_time[word]
wordcloud = WordCloud().generate_from_frequencies(results)
plt.subplot(5, 2, i+1)
plt.imshow(wordcloud, interpolation="bilinear")
plt.title("from {:%H:%M} to {:%H:%M}".format(datetime.datetime.fromtimestamp(portion[0].astype(datetime.datetime)/1e9),
datetime.datetime.fromtimestamp(portion[1].astype(datetime.datetime)/1e9)), fontsize=20)
plt.axis("off")
plt.show()
Now, we have a counter of words and from here, we can find all ways a player is named. The first thing to do will be to extract from every tweets the name of all presents players
def extract_name(x):
to_add = []
tokens = set(x.split("-"))
for name, grammar in france_player.items():
if len( set(grammar).intersection(tokens) ) > 0:
to_add.append(name)
for name, grammar in argentina_player.items():
if len( set(grammar).intersection(tokens) ) > 0:
to_add.append(name)
return "-".join(to_add)
argentina_player = {
"Messi" : ["messi", "lionel", "mess", "leo"],
"Dybala" : ["dybala"],
"Aguero" : ["aguero", "sergio", "agüero"],
"Higuain" : ["higuain"],
"Di Maria" : ["di", "maria", "dimaria", "maría", "mariaa"],
"Mascherano" : ["mascherano"],
"Caballero" : ["caballero"],
"Meza" : ["meza"],
"Pavon" : ["pavon"],
"Armani" : ["armani"],
"Otamendi" : ["otamendi"],
"Rojo" : ["rojo", "marco"],
"Perez" : ["perez"],
"Salvio" : ["salvio"],
"Banega" : ["benega"],
"Biglia" : ["biglia"],
"Acuna" : ['acuna'],
"Tagliafico" : ["tagliafico"],
"Lo Celso" : ["lo", "celso", "lo celso", "locelso"],
"Guzman" : ["guzman"],
"Mercado" : ["mercado", "gabriel"],
"Fazio" : ["fazio"],
"Ansaldi" : ["ansaldi"]
}
france_player = {
"Griezmann" : ["griezmann", "antoin", "griezman"],
"Pogba" : ["pogba", "paul"],
"Giroud" : ["giroud"],
"Mbappe" : ["mbapp", "kylian", "mbappé", "mbappe", "mbape"],
"Lloris" : ["llori", "hugo", "lloris"],
"Dembele" : ["dembel", "dembele"],
"Fekir" : ["fekir"],
"Pavard" : ["pavard", "benjamin"],
"Kante" : ["kant", "kante"],
"Matuidi" : ["matuidi"],
"Hernandez" : ["hernandez"],
"Varane" : ["varan", "varane"],
"Umtiti" : ["umtiti", "samuel"],
"Rami" : ["rami"],
"Thauvin" : ["thauvin", "florian"],
"Tolisso" : ["tolisso"],
"Mandanda" : ["mandanda"],
"Kimpembe" : ["kimpemb", "kimpembe"],
"Lemar" : ["lemar"],
"Mendy" : ["mendy"],
"Areola" : ["areola"],
"Sidibe" : ["sidib", "sidibe"],
"Nzonzi" : ["nzonzi"]
}
X = df["tokens"].apply(extract_name)
Now we have a Series with every tweets and the present name(s). We can convert it to a One Hot Encoded Matrix
X.head()
df_player = X.str.get_dummies("-")
df_player.head()
We can now extract how often a player is mentionned and do a ranking.
df_player.sum(axis=0)
score = Counter(dict(df_player.sum(axis=0)))
score_fr = Counter({key : value for key, value in score.items() if key in france_player})
score_arg = Counter({key : value for key, value in score.items() if key in argentina_player})
print("Most Frequent player from Arg")
print(score_arg.most_common(11))
print("\nMost Frequent player from Fr")
print(score_arg.most_common(10))
x_arg, x_fr, y_arg, y_fr = [], [], [], []
name = []
for i, (player, num) in enumerate(score.most_common(20)):
if player in france_player.keys():
name.append(player)
x_fr.append(i)
y_fr.append(num)
else:
name.append(player)
x_arg.append(i)
y_arg.append(num)
fig, ax = plt.subplots(1, 1, figsize=(20,12))
p1 = ax.bar(x_fr, y_fr, color ="red")
p2 = ax.bar(x_arg, y_arg, color="blue")
ax.yaxis.set_tick_params(labelsize=12)
locs, labels = plt.xticks()
plt.xticks(range(20), name, rotation=90, fontsize=12)
plt.xlim(-0.5, 19.5)
plt.ylabel("Number of tweets", fontsize=15)
plt.title("Ranking of Players by number of tweets", fontsize=20)
ax.legend((p1[0], p2[0]), ('France', 'Argentina'), fontsize=15)
plt.show()
Instead of looking for player, we can do it for both teams
labels = 'Argentina', 'France'
sizes = [sum(score_arg.values()), sum(score_fr.values())]
colors = ['blue', 'red']
explode = (0.05, 0.05)
fig = plt.figure(figsize=(8,8))
patches, texts , autotxt = plt.pie(sizes, labels=labels,
colors=colors,
explode=explode,
autopct='%1.1f%%',
shadow=True,
startangle=90)
for autotext in autotxt:
autotext.set_color('white')
autotext.set_fontsize(15)
for text in texts:
text.set_fontsize(15)
plt.title("Number of tweets with \n a player name", fontsize=20)
plt.axis('equal')
plt.show()
We can see that the most mentionned player is Mbappe from France but in global, there is more tweets for Argentina.
To go more in details with player, we have the sentiment of the tweet and the time. As a result, we can look at differents additional points :
To do so, we will have to group the dataframe by time
df_player_plus = df_player.join(df[["Time", "Sentiment"]])
agg_option = { x : ["sum", "mean"] for x in df_player_plus if x not in ["Time", "Sentiment"]}
agg_option["Sentiment"] = ["min", "max", "mean"]
agg_player = df_player_plus.groupby([df_player_plus.Time.dt.hour, df_player_plus.Time.dt.minute]).agg(agg_option)
agg_player.head()
For readability, we will look at those stats for only the 6 top players (Mbappe, Messi, Di Maria, Pavard, Pogba, Aguero).
We will plot below :
X = agg["min"]
fig, (ax1, ax2, ax3) = plt.subplots(3, 1, figsize=(20,36))
for col in name[:6]:
ax1.plot(X, agg_player[(col, "sum")], label = col)
ax2.plot(X, np.cumsum(agg_player[(col, "sum")]), label = col)
ax3.plot(X, 100*agg_player[(col, "mean")], label = col)
for ax in [ax1, ax2, ax3]:
ax.xaxis.set_major_locator(dates.MinuteLocator(byminute=[0,15,30,45], interval = 1))
ax.xaxis.set_major_formatter(dates.DateFormatter('%H:%M'))
ax.legend(loc=2, fontsize=12)
ax1.set_ylabel("Number of tweet", fontsize=12)
ax2.set_ylabel("Number of tweet", fontsize=12)
ax2.set_ylabel("Percent of tweet", fontsize=12)
ax1.set_title("Number of tweets per minutes", fontsize=20)
ax2.set_title("Cumulative sum of tweets per minutes", fontsize=20)
ax3.set_title("Average of tweets having the name", fontsize=20)
yticks = mtick.FormatStrFormatter('%.0f%%')
ax3.yaxis.set_major_formatter(yticks)
plt.show()
We can see that Mbappe is nearly always the top one in term on number of time he is mentionned but in term of statistic, Di Maria, got the record of tweets per minutes just after his goal and his name was in 45% of tweets during this timelapse. We can also see one peak when Pavard scored with around 23% of tweets with his name.
For this part, we will have to change the dataframe shape again to have one column with sentiment and one column with player. Then we will be compare them
temp = df_player_plus[name[:10] + ["Sentiment"]]
temp = pd.melt(temp, id_vars="Sentiment", value_vars=name[:10])
temp = temp[temp.value != 0 ]
temp.head()
If we do a jitter plot, we can see the balance of sentiment and anso the difference in term of number of time mentionned (1 dot = 1 tweet where the player is mentionned)
plt.figure(figsize=(20, 12))
sns.stripplot('variable', 'Sentiment', data=temp, jitter=0.5, alpha = 0.6)
sns.despine()
plt.show()
Nevertheless this is not very easy to read. If we want to see the trend, we should look at the distribution. The more to the right it will be, the better sentiment people have.
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(20,20))
for player in name[:10]:
sns.distplot(temp[temp.variable == player]["Sentiment"], hist=False, rug=False, label=player, ax=ax1)
sns.distplot(temp[temp.variable == player]["Sentiment"], hist=False, rug=False, label=player, hist_kws={'cumulative': True}, kde_kws={'cumulative': True}, ax=ax2)
ax1.set_title("Distribution of Sentiments", fontsize=20)
ax2.set_title("Cumulative distribution of Sentiments", fontsize=20)
plt.xlim(0, 1)
plt.legend(loc=2)
plt.show()
The player having the best average sentiment is the one where the curve is the most move to the right. In the cumulative sum, this means the curve the most to the bottom as it is the cumulative sum. As a result, we can say :
In this project, we used few NLP tools (mainly used during the processing) to get more in-depth view of twitter opinion of players. We saw that the most mentionned player are not always the ones with only good evaluation. We also took a look at the evolution during the match and maybe used to train a model to detect when a goal is scored.